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Abstract 

^ ■ The origin of long-range letter correlations in natural texts is studied using ran- 

, dom walk analysis and Jensen-Shannon divergence. It is concluded that they result 

from slow variations in letter frequency distribution, which are a consequence of slow 
■ variations in lexical composition within the text. These correlations are preserved by 

random letter shuffling within a moving window. As such, they do reflect structural 
properties of the text, but in a very indirect manner. 

1 Introduction 
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Statistical properties of numerical and symbolic sequences derived from naturally oc- 
curring phenomena are of interest in many different areas. To name just a few examples, 
O ■ human language texts [B El O HI [5l |6] , music [H , DNA sequences |8l f9] , and heartbeat 

recordings [TO] have been subject to such examinations. There appears to be a common 
theme in these studies: the sequences in question are certainly not "random" in some 
sense, they are produced by "complex systems" (we use quotes here to convey an intu- 
00 I itive, non-terminological status of these statements), and so their statistical properties 

' should depart from those of random sequences, thus revealing the regularities. One 

of the hopes is that we can find some general characteristics of information-bearing 
^ I sequences. If this is possible, not only would we achieve new understanding of the 

' systems in question, but also it would be possible to apply the analysis to systems of 

uncertain status, such as the non-coding regions of DNA, to determine whether they 
carry information or not. 

In this work we take a closer look at one particular area where natural-language 
texts were found to depart from randomness: long-range letter correlation. Various 
authors suggested that such correlations, observed on distances of 10^-10^ characters, 
are indicative of stylistic and conceptual (semantic) coherence of the text jBj, or, more 
cautiously, "are of structural origin" [Ij. We will demonstrate that (1) the long-range 
letter correlation arises from slow changes in letter distribution along the text, (2) 
which in turn result from slow changes in lexical composition, (3) the primary role 
being played by the more frequent words. 
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2 Random walk transformation 



A popular method proposed in [9] for assessing the degree of randomness in a numer- 
ical sequence {xj},0 < i < A^, as it is usually presented, is to consider its members 
as sequential steps of a one-dimensional random walk and calculate the mean-square 
displacement as a function of time interval: 

i+k 

yi,k = (1) 
j=i 

m = {ylk)i - {{y^,k)if (2) 

where the angle brackets {■■■){ denote the average over all initial positions i in the 
sequence, and k is the interval length assumed to be much shorter than the total 
sequence length k <^ N. 

Or, if we subtract the mean from the data, ii = Xi — (x), then F{k) becomes the 
mean-square of the partial sums of the resulting sequence, 

i-\-k 

F{k) = {Si)^, S,fc = ^e, (3) 
j=i 

If each Xi results from an independent trial of a random variable with variance cr^, 
F{k) = ka"^ and thus grows linearly with k. If, on the other hand, there are correlations 
in the sequence, i.e. some averages {xiXi+k)i do not vanish, the growth of F{k) may 
depart from linearity. Generally speaking, power-law growth 

F{k) ~ A:" (4) 

may indicate fractal structure of some sort in the data sequence. The quantity a is the 
Holder (or Hurst) exponent of order 2. 

There are many possible ways to convert a natural text to a numerical sequence 
in order to calculate F{k). One can use a binary representation of characters and 
consider consecutive bit values in it [2], or assign a numerical value to each letter 
[1]. One can also work on the level of words and replace each word by its frequency 
rank [6], or build a binary sequence where 1 (resp., 0) corresponds to the transition 
to a longer (resp., shorter) word or to a more frequent (resp., less frequent) word [3]. 
Regardless of the method used, the cited authors found departures from the linear 
growth of the displacement function F{k). We will loosely follow the method of [l] 
here to demonstrate the result. We use one of the texts analyzed in that work, Herman 
Melville's magnum opus Moby Dick having a respectable volume of about 1.2 • 10^ 
letters. For comparison, we also utilize Dickens' David Copperfield with over 2 • 10^ 
letters. Before processing, the texts were converted to lowercase and non-alphabetic 
character sequences were collapsed to single spaces (i.e., in regular expression terms, 
s/ ["a-z] +/ /g). The resulting character sequence is the subject of all further analysis. 

To obtain a numerical sequence from the text for the random walk analysis, follow- 
ing [1], we select a letter and convert all instances of that letter to ones, and all other 
characters to zeros. Fig [T] shows F{k) for three letters 'a', 'v', and 'x' representative of 
high, middle, and low frequency characters. 
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[Figure 1 about here.] 



In the range of 10 < /c < 600000, where end effects can be neglected, there are 
three more of less distinct regions in the chart Fig. [TJ Roughly between 10 and 200 
characters, F{k) exhibits linear growth indicating lack of correlations. Between 200 
and 1200 characters (for 'a'), it is consistent with a power-law growth with exponent 
a ~ 1.2 (consistent with the value reported in [Ij). Above that, F{k) seems to return 
to linear growth. Note the strikingly similar behavior of all three letters. It should 
be noted that the authors of [I] studied the displacement function averaged over all 
letters. It is interesting though that long-range correlations are revealed even in the 
sequences obtained from individual letters. 

This result of [1] was corroborated there with other measures (see also |4|, where 
detrended fluctuation analysis technique was applied to Moby Dick). The question 
we are concerned with here is what this result actually means. Ebeling et al. [1] 
demonstrated that if the text is randomly shuffled — whether on the level of letters, 
whole words, or complete sentences — the correlations are destroyed, and F{x) returns 
to the linear growth in the entire valid range of k. This demonstrates that neither 
intra- word letter correlations, nor intra-sentence syntactic and semantic relations are 
responsible for the observed large-scale behavior. What is, then? 

3 Slow distribution changes 

The sentence-level shuffling of the text preserves the syntax and semantics of the lan- 
guage, but destroys the overall narrative with its plot and composition. Since it also 
destroys long-range correlation, it could be tempting to conclude that the correla- 
tion is a direct consequence of the narrative structure. It is easy, however, to dis- 
prove this notion by shuffling the text within a moving window. Namely, consider 
the original text as a sequence of characters T = {cj} and derive from it a character 
sequence T', where the i-th position is occupied by a character randomly selected from 
{cj\i — n/2 < j < i + n/2}, where n is the window size. This sequence preserves the 
overall letter distribution of T and, in addition, any slow changes in this distribution, 
but completely destroys everything else; T' is not a natural language text, and T can 
not be reconstructed from it. Fig. [2] compares the behavior of F{k) for the original text 
and for the sequence shuffled with window n = 3000. Clearly, all the features of the 
random walk are preserved by the shuffling. With increasing window size, as expected, 
the linear region extends to the right, and eventually long-range correlations disappear 
altogether as window size exceeds the maximum correlation length. 

[Figure 2 about here.] 

This leaves little room for speculation: obviously, the specific behavior of the dis- 
placement function is solely a result of bulk letter distribution in the text, unrelated 
to any structural features that distinguish an arbitrary character sequence from a text 
in a natural language. 

To further demonstrate this, we generate another character sequence which has no 
relationship to the Moby Dick text, and results from a random process that generates 
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the letter 'a' with probabihty p and a blank space character with probability 1—p, where 
p = 0.062 except in a short range of length 6250, where p = 0.1054 (these parameters 
were selected to obtain the desired qualitative behavior of the displacement function; 
they have no significance beyond that). Again, the displacement function exhibits 
the same qualitative features as for Moby Dick, as shown in Fig. [3l Of course, the 
distribution of the letter 'a' in Moby Dick is very different, but the point is that F{k) 
does not reveal the difference. 

[Figure 3 about here.] 

We can conclude that the behavior of the displacement function F{k) results from 
slow changes in bulk letter distribution in the text on the scales of the order 10^-10^ 
characters. But why would the letter distribution be changing at all? It would seem a 
priori that it should be a rather stable feature of a given language, or at the very least, 
a given language subset. For example, children's books may have a lower frequency 
of such letters as 'q' or 'x', which in English appear mostly in the "long" words of 
Latin origin. But there is no apparent reason why the frequency of such a neutral and 
common letter like 'a' should be subject to slow fluctuations. 

To investigate this issue, we turn to a different tool. 

4 Jensen— Shannon information divergence 

We want to compare letter distributions in different segments of the text. A convenient 
measure for this is provided by Jensen-Shannon information divergence (JSD) Let 
p = {p + i} and q = {qi}, 1 < i < n he two frequency distributions of the same 
dimensionality n, normalized so that ^Pi = J2Qi — 1- Define 

D{p, q) = H{{p + q)/2) - {H{p) + H{q))/2 (5) 

where H is the entropy of the distribution 

H{p) = ^Pi\ogpi (6) 

i 

and {p + q)/2 is a shorthand for the distribution r, such that = {pi + qi) /2. (If 
p and q are determined from different numbers of trials, they should be weighted 
with corresponding coefficients in ([5]), but we don't need this generalization here.) 
This measure is related to the mutual information between the two distributions, and 
vanishes iff they are identical. 

JSD was applied in [8j to DNA sequences and in ^ to texts and music for the pur- 
pose of segmentation, i.e. splitting a sequence into parts maximizing the difference in 
composition (whether in terms of "letters", "words", "keywords", etc). Here we have 
a different application in mind. We want to find out whether the letter distribution 
undergoes statistically significant changes along the text. To this end, we will compare 
two adjacent, equal length, regions of the text, and we need to determine whether 
the two observed frequency distributions in them are likely to result from the same 
underlying probability distribution. Consequently, we need to calculate the fluctuation 
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level, i.e. the expected JSD between two realizations of the same probability distribu- 
tion. General statistical properties of JSD were obtained by Grosse et al. |8jj , and we'll 
briefly reproduce the derivation for the particular case at hand. 

Let p be the probability distribution and q the observed frequency distribution 
obtained from N ^ n trials. The variance of each qi is then af = l/{piN). Assuming 
that it is small, cjj <C Pi, we can represent qi = pi{l + e,), ej = 0{{piN)^^/'^) and 
estimate for each term of the sum in ([5]) 

Di{p,q) = ^(pilogpi+ qdog qi- (pi + qi) log ^^"j (7) 

= 2 ((1 + £,) log(l + ei) - (2 + ei) log(l + e,/2)) (8) 

~ lp^el (9) 
= 0(1/8A^) (10) 

where the first two terms in the Taylor expansion of log(l + x) were used (assuming 
natural logarithms to simplify the expressions). Since the deviation here is unidirec- 
tional, i.e. JSD can not be negative, the estimate for the sum in ([5]) is to be multiplied 
by n — 1, the number of degrees of freedom. Finally, if both p and q are realizations of 
an unknown probability distribution, this adds another factor of 2, and we arrive at 

Tt — 1 

Fluctuation level of D(p,q) = (11) 

AN 

where, again, N is the total number of trials in each of p, q, and n is the number of 
possible outcome^]. 

[Figure 4 about here.] 

Fig. [3] shows how JSD between adjacent segments of length n varies along the text 
for two window sizes, n = 1000 and 100000 characters (intermediate sizes not shown 
to avoid clutter). For the shortest window size, JSD is at the fluctuation level (which 
confirms the estimate (lllh as a side effect). With larger window size, however, sys- 
tematic variations in letter composition stand out from the decreasing statistical noise 
and become significant. It may be interesting to see whether the peaks in the figure 
match some compositionally meaningful locations in the text, but for the purposes 
of this work what's important is that JSD is comfortably above the fluctuation level 
practically everywhere, albeit in some places more so than in others. 

Obviously, in the natural text, letters come in packages — words — and any changes 
in letter composition along the text must result from the changes in lexical composition. 
It is well known that words in the language are distributed in a highly skewed fashion, 
with many instances of a small number of frequent word types and increasingly larger 
number of rare word types. The distribution is approximately described by Zipf 's law 

m 

fk ~ l/k (12) 



^Interestingly, the probabilities themselves do not enter this estimate at all. This is in an apparent 
contradiction with the fact that a pdf with some number of possible outcomes n and p„ = is completely 
equivalent to a pdf with n~l possible outcomes, while the estimate (|lip will be different for them. However 
the estimate is not valid when some pi tends to zero, because this violates the assumption of small af. 
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where fk is the frequency of the word with rank k, and the rank is the word's sequence 
number in a dictionary where words are ordered by decreasing frequency. The top 
positions in such a dictionary are occupied by grammatical words (articles, prepositions, 
personal pronouns, conjunctions, etc.) and high-frequency significant words (nouns 
like man, adjectives like old), which are common for all texts, and by select "content 
words" peculiar to a particular text (ship, Ahab, whale in Moby Dick). The top of the 
dictionary is relatively stable, while the rest of it is much more subject to changes from 
text to text and within texts, depending on style, topic, etc. It is not clear a priori 
which part(s) of the lexicon are responsible for the changes in letter composition of 
the text: the less frequent words are, generally speaking, more variable, but because 
there are many more of them, the law of large numbers should ensure a more random 
mixing of the letters; the more frequent words are less variable, but any change in their 
distribution would have a larger impact, because there is a small number of frequent 
word types. In the next section we focus on this question. 

5 Lexical composition and its impact on letter 
distribution 

To investigate the effect of different parts of the vocabulary, we applied the analysis 
of the previous section separately to the words in different frequency ranges. The 
frequency dictionary of the text was subdivided into 5 ranges so that words in each 
range are responsible for 20% of the total number of letters each. For each range, the 
rest of the words were blanked out, and JSD between adjacent 100000-letter segments 
were calculated (blanks were not counted). Fig. [5] shows the average JSD normalized by 
the fluctuation level for Moby Dick and David Copperfield. Interestingly, it is somewhat 
above fluctuation level even for the most infrequent words, but the biggest contribution 
in both cases is due to words that are close to the top of the dictionary, but not the 
most frequent ones. It is still a relatively small number of word types (135 word types 
for MB and 80 word types for DC). Most names of the major characters fall into this 
frequency range. It is a more idiosyncratic set of words than the top of the lexicon, but 
it is still limited enough that the letters are not well mixed according to probabilities. 

[Figure 5 about here.] 

It is easy to see qualitatively why the slowly changing composition of the top ~ 10^ 
words will lead to corresponding changes in letter distribution and long-range letter 
correlation. As a simple model, suppose that 100 "content" words are responsible for 
20% of all letters in a 10^-letter text segment. These 100 words are selected from the 
lexicon of the language, which, as a whole, is characterized by some letter frequency 
distribution pi. The content words are selected by the writer according to the topic 
and style, but the resulting selection of letters is essentially random (except for very 
rare cases of highly alliterated prose). However due to the small number of "trials", it 
will have a considerable variance. For example, the average frequency of the letter a in 
English is about 10%, hence out of the 100 • 4.5 = 450 letters in the 100 content words 
there will be about 45 'a's. Depending on which 100 words are chosen, the expected 
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variance is on the order of \/45 7, i.e. as much as 15%. Even if the variance in 
the remaining 80% of the text is ncghgible, he frequency of 'a's will fluctuate much 
stronger than for a Poissonian process on the characteristic lengths where the "content 
words" are stable. 

6 Discussion 

From the analysis wc presented in this paper, it follows that long-range letter correlation 
in natural texts results from the interplay of the following factors: 

1. a significant portion of the letters in texts is contributed by a relatively small 
class of "content" words with high frequency and high variability in the text; 

2. slow variation in the composition of the "content" words causes corresponding 
slow variation in the letter distribution; 

3. this translates to long-range correlation between letters, which is invariant with 
respect to letter shufHing within sliding window of length 3000. 

The variation in lexicon may reflect various properties of the text. For example, here 
are some of the differences observed between the first and the second halves of Moby 
Dick: 

1. increased frequency of the word whale in the second half refiects topical differ- 
ences; 

2. the word is is more frequent than was in the first half, but less frequent in the 
second half, reflecting the difference in narrative structure; 

3. the ratio of articles the to a increases from 2.7 in the first half to 3.5 in the 
second half, which may, for example, indicate the trend from general statements 
to concrete narrative. 

The long-range letter correlations can serve as an indirect and indiscriminate indicator 
of slow variations of character frequency distribution. In natural texts, these variations 
result from the corresponding slow variations of lexical composition, which in turn 
reflect various structural properties of the text. However in the case of symbolic and 
numerical sequences of a different origin, such variations in and of themselves do not 
necessarily indicate "complexity" or information-bearing nature. 
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Figure 1: Displacement function for individual letters in Moby Dick. 
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Figure 2: Displacement function for a window-shuffled text of Moby Dick. 
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Figure 3: Displacement function for an artificial character sequence (see text). 
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Figure 4: Jensen — Shannon divergence between letter distributions in adjacent segments of 
length n from Moby Dick, normalized by fluctuation level. 
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Figure 5: Average JSD between two adjacent subsequences of length 10^ characters for Moby 
Dick and David Copperfield, normahzed by fluctuation level, calculated with words from 5 
frequency ranges. 
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